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(54) Method and apparatus for improving the efficiency of support vector machines 

(57) A method and apparatus is described for 
improving the efficiency of any machine that uses an 
algorithm that maps to a higher dimensional space in 
which a given set of vectors is used in a test phase. In 
particular, reduced set vectors are used. These reduced 
set vectors are different from the vectors in the set and 
are determined pursuant to an optimization approach 
other than the eigenvalue computation used for homo- 
geneous quadratic kernels. An Illustrative embodiment 
is described in the context of a support vector machine 
(SVM). 
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Descripti n 

Field of the Invention 

5 This invention relates generally to universal learning machines, and, in particular, to support vector machines. 

Background of the Invention 

A Support Vector Machine (SVM) is a universal learning machine whose decision surface is parameterized by a set 
10 of support vectors, and by a set of con'esponding weights. An SVM is also characterized tjy a kernel function. Choice 
of the kernel determines whether the resulting SVM is a polynomial classifier, a tvro-layer neural network, a radial t}asis 
function machine, or some other learning machine. A decision rule for an SVM is a function of the corresponding kernel 
tunct jon ad support vectors. - — 

An SVM generally operates In two phases: a training phase ad a testing phase. During the training phase, the set 
15 of support vectors is generated for use in the decision rule. During the testing phase, decisions are made using the par- 
ticular decision rule. Unlortunateiy, in this latter phase, the complexity of computation for an SVM decision rule scales 
with the number of support vectors, Ns, in the support vector set. 

Summary of the Invention 

20 

We have realized a method and apparatus for improving the efficiency of any machine that uses a algorithm that 
maps to a higher dimensional space in which a given set of vectors is used in a test phase. In particular, and in accord- 
ance with the principles of the invention, reduced set vectors are used. The number of reduced set vectors is smaller 
than the number of vectors in the set These reduced set vectors are cfifferent from the vectors in the set and are deter- 
25 mined pursuant to ^ptimization ap grogch other than the eigen>ralue computation used for homogeneous quadratic 
kernels. /'""^^^ ' — 

In an embodiment of the invention, an SVM, for use in pattern reco gnitiorT) utilizes reduced set vectors, which 
improves the efficiency of this SVM by a user-chosen factor. Tftese reduce^seTvectors are determined pursuant to a 
unconstrained optimization approach. 
30 In accordance with a feature of the invention, the selection of the reduced set vectors allows direct control of per- 
formance/complexity trade-offs. 

In addition, the inventive concept is not specific to pattern recognition and is applicable to ay problem where the 
Support Vector algorithm is used (e.g^^^r^ion-estimati(3n). 

35 Brigf D^cripBon of the Drawing 

FIG. 1 is a fkw chart depicting the operation of a prior art SVM; 

FIG. 2 is an general representation of the separation of training data into two classes with representative support 
vectors; 

40 FIG. 3 shows a Illustrative method for training a SVM system in accordance with the principles of the invention; 

FIG. 4 shown a illustrative method for operating a SVM system in accordance with the principles of the Invention; 
and 

FIG. 5 shows a block diagram of a portion of a recognition system embodying the principles of the invention. 
45 Detailed Description 

Before describing an illustrative embodiment of the invention, a brief background is provided on support vector 
machines, followed by a description of the inventive concept itself. Other than the inventive concept, it is assumed that 
the reader is familiar with mathematical notation used to generally represent kernel-based methods as known in the art. 

so Also, the inventive concept is Illustratively described in the context of pattern recognition. However, the inventive con- 
cept is applicable to any problem where the Support Vector algorithm is used (e.g., regression estimation). 

In the description below, it should be noted that test data was used from two optical character recognition (OCR) 
data sets containing gray level images of the ten digits: a set of 7.291 training and 2.007 test patterns, which is referred 
to herein as the "postal set" (e.g., see L. Bottou, C. Cortes, H. Drucker, LD. Jackel, Y. LeCun. U.A. MOIIer, E. SSckinger. 

55 P. Simard. and V. Vapnik Comparison of Classifier /Methods: A Case Study in Handwritten Digit Recognition. Proceed- 
ings of the 12th lAPR International Conference on Pattern Recognition. Vol. 2. IEEE Computer Society Press, Los Ala- 
mos. CA. pp. 77-83. 1994; and Y. LeCun. B. Boser. J.S. Denker. D. Henderson. R.E. Howard. W. Hubbard. LD. Jackel. 
Backpropagation Applied to Handwritten ZIP Code Recognition. Neural Computation. 1 . 1989. pp. 541-551). and a set 
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of 60,000 training and 10,000 test patterns trom NIST Special Database 3 and NIST Test Data 1, which is referred to 
herein as the "NIST set" (e.g., see, R.A. Wilkinson, J. Geist. S. Janet, P.J. Grother, C.J.C. Burges, R Creecy, R Ham- 
mond. J.J. Hull. N.J. Larsen. T.R VogI and C.L. Wilson, The First Census Opttal Character Recognitor) System 
Corrference, US Department of Commerce, NIST, August 1992). Postal Images were 16x16 pixels and NIST Images 
were 28x28 pixels. 

Background ■ Support Vector Machines 

In the following, bold face is used for vector and matrix quantities, and light face for their components. 
Consider a two-class classifier for which the decision rule takes the form: 



> = ©(Za,K(x,s,)+6), (1) 
i-1 



where x, Sj e R^, a;, b e R. and e is the step function; Is the d-dimensional Euclidean space and R is the real line; 
ttj, Sj, Ns and b are parameters and x is the vector to be classified. The decision rule for a large family of classifiers can 
be cast in this functional form: for example. 



implements a polynomial dassrfier; 

A:=expHi-s,||'/a^ 
implements a radial basis function machine; and 

^:=tanh(y(x.Si) + 5) 

implements a two-layer neural network (e.g., see V. Vapnlk, Estimation of Dependencies Based on Empirical Data. 
Springer Verlag, 1982; V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995; Boser. B.E., Guyon. 
I.M., and VapnlK V., A training algorithm for optimal margin classifiers. Fifth Annual Workshop on Computational Learn- 
ing Theory. Pittsburgh ACM 144-152. 1992; and B. SchOkopf. C.J.C. Burges, and V. Vapnik. Extracting Support Data 
tor a Given Task, Proceedings of the First International Conference on Knowledge Dise»very and Data Mining, AAAI 
Press. Memo Park. CA. 1995). 

The support vector algorithm is a principled method for training any learning machine whose decision rule takes 
the form of Equation (1): the only condition required is that the kernel K satisfy a general positivity constraint (e.g., see 
The Nature of Statistical Learning Theory, and A training algorithm for optimal margin classflers, cited above). In con- 
trast to other techniques, the SVM training process determines the entire parameter set {aj, s„ Ng and bj; the resulting 
Sj, i = 1 A/s are a subset of the training set and are called support vectors. 

Support Vector Machines have a number of striking properties. The training procedure amounts to solving a con- 
strained quadratic optimization problem, and the solution found is thus guaranteed to be the unique global minimum of 
the objective function. SVMs can be used to directly implement Structural Risk Minimization, in which the capacity of 
the learning rrachine can be controlled so as to minimize a bound on the generalization error (e.g.. see The Nature of 
Statistical Learning Theory, and Extracting Support Data for a Given Task, cited above). A support vector decision sur- 
face is actually a linear separating hyperplane in a high dimensional space; similarly, SVMs can be used to construct a 
regrKsioiTj^jrtflch is linear in some high dimensional space (e.g.. see The Nature of Statistical Learning Theory, dted 
""above)?"" 

Support Vector Learning Machines have been successfully applied to pattern recognition problems such as optical 
character recognition (OCR) (e.g., see The Nature of Statistical Learning Theory, and Extracting Support Data for a 
Given Task, cited above, and C. Cortes and V. Vapnik. Support veaor Networks, Machine Learning. Vol 20. pp 1-25. 
1995). and object recognition. 

FIG. 1 is a flow chart depicting the operation of a prior art SVM. TNs operation comprises two phases: a training 
phase ad a testing phase. In the training phase, the SVM receives elements of a training set with pre-assigned classes 
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in step 52. In step 54, the input data vectors from the training set are transformed into a multi-dimensional space. In 
step 56. parameters (i.e.. support vectors and associated weights) are determined for a optimal multi-dimensional 

hyperplane. 

FIG. 2 shows a example where the training data elements are separated into two classes, one class represented 
by circles and the other class represented by boxes. This is typical of a 2-class pattern recognition problem: for exam- 
ple, a SVM which is trained to separate patterns of "cars" from those patterns that are "not cars." An optimal hyperplane 
is the linear decision function with maximal margin between the vectors of two classes. That is, the optimal hyperplane 
is the unique decision surface which separates the training data with a maximal margin. As illustrated in FIG. 2. the opti- 
mal hyperplane is defined by the area where the separation between the two classes is maximum. As observed in FIG. 
2, to construct a optimal hyperplane. one only has to take into account ajmallj ubiset of J he trained data elemerite., 
which determine this maximal margin. Thi|^su^.et_of_training elements thaTdifermines thepafameters'of'arroptimal 
hyperplane are known as support vectors. In FIG. 2, the su^ortvectors are indicating by shading. 

The optimal hyperplane parameters are represented as linear combinations of the mapped support vectors m the 
high dimensional space. The SVM algorithm ensures that errors on a set of vectors are minimized by assigning weights 
to all of the support vectors. These weights are used in computing the decision surface in terms of the support vectors. 
The algorithm also allows for these weights to adapt in order to minimize the error rate on the training data belonging 
to a particular problem. These weights are calculated during the training phase of the SVM. 

Constructing an optimal hyperplane therefore becomes a constrained quadratic optimization programming problem 
determined by the elements of the training set and functions determining the dot products in the mapped space. The 
solution to the optimization problem is found using conventional intermediate optimization techniques. 

Typically, the optimal hyperplane involves separating the training data without any en-ors. However, in some cases, 
training data cannot be separated without errors. In these cases, the SVM attempts to separate the training data wfth a 
minimal number of errors and separates the rest of the elements with maximal margin. These hyperplanes are gener- 
ally known as soft margin hyperplanes. 
_,^^_^_Jn the^l^^^^^, the SVM receives elements of a testing set to be classified in step 62. The SVM then trans- 
forms fRFIrput data vectors of the testing set by mapping them into a multi-dimensional space using support vectors 
as parameters in the Kernel (step 64). T3]engpeipg-<ur<etiQnJs.detein]jngdjy the choice ^ kernel whic h.is.pj:eloaded 
in the SVM. The mapping involves taking a single vector ad transforming it to aTriglvdimensional feature^ace so that 
a linear decision function can be created in this high dimensional feature space. Although the flow chart of FIG. 1 shows 
implicrt mapping, this mapping may be performed explicitly as well. In step 66. the SVM generates a classification signal 
from the decision surface to indicate the membership status of each input data vector. The final result is the creation of 
a output classification signal, e.g., as illustrated in FIG. 2. a (-fl) for a circle and a (-1) tor a box. 

Unfortunately, the complexity of the computation for Equation (1) scales with the number of support vectors Ng. 
The expectation of the number of support vectors is bounded beiow by (/■1)E(P). where P is the probability of en'or on 
a test vector using a given SVM trained on / training samples, and E[P] is the expectation of P over all choices of the / 
samples(e.g., see The Nature of Statistical Learning Theory, cited above). Thus Ns can be expected to approximately 
scale with /. For practical pattern recognition problems, this results in a machine virtiich is considerably slower in test 
phase than other systems with similar generalization performance (e.g., see Comparison of Classifier Methods: A Case 
Study in Handwritten Digit Recognition, cited above; and Y LeCim. L. Jackel. L. Bottou. A. Brunot. C. Cortes, J. Danker, 
H. Drucker. I. Guyon. U. Muller, E. SScWnger, P. Simard. and V. Vapnik, Comparison of Learning Algorithms for Hand- 
written Digit Recognition. International Conference on Artificial Neural Networks. Ed. F. Fogelman. P. Gatlinarl, pp. 53- 
60, 1995). 

Reduced Set Vectors 

Therefore, and in accordance with the principles of the inventkin, we present a method and apparatus to approxi- 
mate the SVM decision rule with a much smaller number of reduced set vectors. The reduced set vectors have the fol- 
lowing properties: 

• They appear m the approximate SVM decision rule in the same way that the support vectors appear in the full SVM 
decision rule: 

• They are not support vectors: they do not necessarily lie on the separating margin, and unlike support vectors, they 
are not training samples: 

They are computed for a given, trained SVM; 

• The number of reduced set vectors (and hence the speed of the resulting SVM in test phase) is chosen a priori; 

• The reduced set method is applicable wrtierever the support vector method is used (lor example, regression esti- 
mation). ____ 
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The Reduced Sel 

Let the training data be elements x e L. . where L (for "lew dimensionar) is defined to be the -dimensional Eucli- 
dean space 

An SVM performs a implicit mappng <!> : x x, x e H (for "high dimensionar), similarly 

H=/?*H.rf^ < 00. 

In the following, vectors in H will be denoted with a bar. The mapping »!> is determined by the cholcg oj_ke mel K. In tac t. 
»5 for any K which satisfies Mercer s positivity constraint (e.g., see. The Nature of Statistical Learning Theory, and A train- 
ing algorithm for optimal margin classifiers, cited above), there exists a pair H} for which 

K(x,-, x,)= Xi.Xj-. 

20 

Thus in H. the SVM decision rule is simply a linear separating hyperplane (as noted above). The mapping 0 is usually 
not explicitly computed, and the dimension c/h of H Is usually large (for example, for the honx}geneous map 

(the number of ways of choosing p objects from p + di_- 1 objects: thus for degree 4 polynomials and for d/.= 255. cIh 
is approximately 180 million). 

The basic SVM pattern recognition algorithm solves a two-class problem (e.g., see Estimation of Dependencies 
30 Based on Empirical Data, The Nature of Statistical Learning Theory, A training algorithm for optimal margin classifiers , 
cited above). Given training data x € L and con-espondlng class labels // e {-1,1}. the SVM algorithm constructs a deci- 
sion surface eH which separates the x,- into two classes (i = 1, f): 

W.Ii+b>k„-i,,,y, =+l (2) 
< A,+^,,>., =-1, (3) 



where the d are positive slaci< variables, introduced to handle the non-separable case (e.g., see Support Vector 
40 Netvx)ri^s, cited above). In the separable case, the SVM algorithm constructs that separating hyperplane for which the 
margin between the positive and negative examples In H is maximized. A test vector x e L is then assigned a class lat>el 
{+1, -1} depending on whether 

T.<I)(x) + A 



is greater or less than (l<o + k ^)/2. A support vector s e L is defined as any training sample for which one of the equa- 
tions (2) or (3) is an equality. (The support vectors are named 5 to distinguish them from the rest of the training data). 
* is then given by 

^= Z ^aya<i>(Sa) (4) 
fl=I 

£5 

where Ua^O are the weights, determined during training, Xg e /-h r, - the class labels of the Sa. and is the number 
of support vectors. Thus in order to classity a test point x one computes 
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'J'-X = I Oiayah-^ = I CLaya^i^a.^) - (5) 



However, and in accordance with the inventive concept, consider now a set Za e L a = 1 A/^ and corresponding 

weights € R for which 

^'-Z/*(-,) (6) 



minimizes (tor fixed Nz) the distance measure 

P = \\V- Til. (7) 

As defined herein, the {7^, Zg}, a = J are called the reduced set. To classily a test point x, the expansion 

in Equatton (5) is replaced by the approximation 

_ _ Nz _ Nz 

^'•^=lYaSa-3C=IyaK(z«.s). (8) 



The goal is then to choose the smallest « Ns, and corresponding reduced set. such that any resulting loss in 
generalization performance remains acceptatjie. Clearly, by alowing Nz = A/g. p can be made zero; there are non-trivial 
cases where A/7 < A/g, and p = 0 (described below). In those cases the reduced set leads to a reduction in the decision 
rule complexity with no loss in generalization performance. If for each Nz one computes the corresponding reduced set. 
p may be viewed as a monotonic decreasing function of Nz. and the generalization performance also becomes a func- 
tion of Nz. In ttiis description, only empirical results are provided regarding the dependence of the generalization per- 
formance on Nz. 

The following should be noted about the mapping <J>. The image of <2> will not in general be a linear space. C will 
also in general not be suijective, and may not be one-to-one (for example, when K is a homogeneous polynomial of 
even degree). Further, 0 can map linearly dependent vectors in L onto linearly independent vectors in H (tor example, 
when K" is a inhomogeneous polynomial). In general one cannot scale the coefficients yg to unity by scaling z,. even 
when «• is a homogeneous polynomial (tor example, if K is homogeneous of even degree, the can be scaled to T , - 
r;, but not to unity). 



In this Section, the problem of computing the minimum of p analytically is considered. A simple, but non-trivial, case 
is first described. 

Homogeneous Quadratic Polynomials 

For homogeneous degree two polynomials, choosing a normalization of one: 

K(x/. s,) = (e, . s,)^. (9) 

To simplify the exposition, the first order approximation. A/^ = 7 is computed. Introducing the symmetric tensor 
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it can be found that 

p=||^-Y»|| 

is minimizecl for {y, z) satisfying 

Sf»,Zo=y^'^^, (11) 

(repeated indices are assumed summed). With this choice of {y, z}, becomes 

=5.„5^''-rV. (12) 



The largest drop in p is thus achieved when [y. z\ is chosen such that r is that eigen^/ector of S whose eigenvalue 
= yz ^ has largest absolute size. Note that y can be chosen so that Y=sign{X} . and z scaled so that z ^ = |5i.| . 
Extending to order N2, it can similarly be shown that the Zj in the set [yj. Zj) thiat minimize 

P=ll^-§y^J (13) 
fl=i 

are eigenvectors of S, each with eigenvalue y ,||z ^ . This gives 

p' ^S^^st"" - Zrl\M- (14) 



and the drop in p Is maximized if the z, are chosen to be the first eigenvectors of S. where the eigenvectors 
are ordered by absolute size of their eigenvalues. Note that, since rrace(S^) is the sum of the squared eigenvalues of 
S, by choosing A/; = di (the dimension of the data) the approximation becomes exact, i.e., p = 0. Since the number of 
support vectors Ns is otten larger than d^. this shows ttiat the see of the reduced set can be smaller than the number 
of support vectors, with no loss in generalization performance. 

In the general case, in order to compute the reduced set, p must be minimized over all {yg, z,], a = 1 simul- 
taneously It is convenient to consider a incremental approach in which on the ith step, {jj, Zj), j < i are held fixed while 
[Yi, zJ is computed. In the case of quadratic polynomials, the series of minima generated by the incremental approach 
also generates a minimum for the full problem. This result is particular to second degree polynomials and is a conse- 
quence of the fact that the Zf are orthogonal (or can be so chosen). 

Table 1, below, shows the reduced set size N2 necessary to attain a number of errors on the test set. where 
differs from the number of errors Eg found using the full set of support vectors by at most one error, for a quadratic pol- 
ynomial SVM trained on the postal set. Clearly, in the quadratic case, the reduced set can offer a significant reduction 
in complexity with little loss in accuracy Note also that may digits have numbers of support vectors larger than di = 256, 
presenting in this case the opportunity for a speed up with no loss In accuracy. 
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Support Vectors 


Reduced Set 


Digit 


Ns 


Es 


Nz 


Ez 


0 


292 


15 


10 


16 




95 


9 


6 


9 


2 


415 


28 


22 


29 


3 


403 


26 


14 


27 


4 


375 


35 


14 


34 


5 


421 


26 




27 


6 


261 


13 


12 


14 


7 


228 


18 


10 


19 


8 


446 


33 


24 


33 


9 


330 


20 


20 


21 



To apply^le reduced set method 1o a arbitrary support vector machine, the above analysis must be extended for a 
general kernel. For example, for the homogeneous polynomial 



K(x,,S2)=J\^(x,.Zjf. 



setting 



dp/&^,^ =0 

to find the first pair {y, . zt) in the incremental approach gives an.equation analogous to Equation (11): 



(15) 



In this case, varying p with respect to y gives no new conditions. Having solved Equation (15) for the first oider solu- 
tion {y,. r,}, p2 becomes 

P' = S^,^..^,S''''''-''-y]\\^,\\ (17) 



One can then define 
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(18) 



in terms of which the inaemental equation for the second order solution Zj tal<es the form of Equation (15), with 
S, Zf and y, replaced by S , Z2 and y^, respectively. (Note that for polynomials of degree greater than 2, the will not 
in general be orthogonal). However, these are only the incremental solutions: one still needs to solve the coupled equa- 
tions where all {^a. Za) a^e allowed to vary simultaneously. Moreover, these equations will have multiple solutions, most 
of which will lead to local minima in p. Furthermore, other choices of K will lead to other fixed point equations. While 
solutions to Equation (1 5) could be found by Iterating (i.e. by starting with arbitrary z, computing a new z using Equation 
(15), and repeating), the method described in the next Section proves more flexible and powerful. 

Unconstrained Optimizat ion Approach 

Provided the kernel K has first derivatives defined, the gradients of the objective function F= with respect to 
the unknowns Zj) can be conputed. For example, assuming that K(s„, is a function of the scalar s„-s„: 

t ani;'„K(s„.lk)+ i: YyKfey.Zik) (19) 
^ m=\ 

1 i:rkamy„K'(s„.zt)Sm^ (20) 



Therefore, and in accordance with the principles of the invention, a (possibly local) minimum can then be found 
using unconstrained optimization techniques. 

35 

The Algorithm 

First, the desired order of approximation, A/^. is chosen. Let 

A two-phase approach is used. In phase 1 (desaibed below), the Xj are computed incrementally, keeping all Zj, / < /, 
fixed. 

4S In phase 2 (descritied below), all X,- are allowed to vary. 

It should be noted that the gradient in Equation (20) is zero if is zero. This fact can lead to severe numerical insta- 
bilities. In order to circumvent this problem, phase 1 relies on a simple "level crossing" theorem. The algorithm is as fol- 
lows. First, y, is initialized to -1-1 or -1 ; Zj is initialized with random values Zj is then allowed to vary, while keeping n fixed. 
The optimal value for y,, given that Z|. Xj. / < / are fixed, is then computed analytically. F is then minimized with respect 

so to both z, and y, simultaneously. Finally, the optimal rj tor all / < i is computed analytically, and are given by r = Z " V . 
where vectors A. r and Z are given by (see equation (19)): 

T; - yj, (21) 
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N, 

Ay = i aay^KCsa.zy) , and (22) 
a=l 

Z,,= K(z,,z*). (23) 



Since Z is positive definite and symmetric, it can be inverted eflicienfly using the well-known Choleski decomposi- 
tion. 

Thus, the first phase of the algorithm proceeds as follows: 



[1] 


Choose Yi = +1 or -1 randomly, set to a selection of random values; 




12] 


vary Zf to minimize F; 




[3] 


compute the yi , keeping fixed, that maximally further reduces F; 




14] 


allow zi . Yi to vary together to further reduce F; 




[5] 


repeat steps [1 ] through [4] T times keeping the best answer; 




[6] 


fix z^.y-f. choose = +1 or -1 randomly, set zj to a selection of random values: 




[7] 


vary Zj to minimize F; 




[8] 


then fixing Zj (and z^ , y,) compute the optimal y2 that maximally further reduces F; 




[9] 


then let {zj-yz} vary together, to further reduce F; 




[10] 


repeat steps [6] to [9] T times, keeping the best answer; and 




[11] 


finally, fixing z^, zj, compute the optimal y-^ , y^ (as shown above in equations (21) - (23)} that fi 


jrther reduces F. 



This procedure is then iterated with {Z3. 73} and {24. y^. and so on up to 



Numerical instabilities are avoided by preventing from approaching zero. The above algorithm ensures this auto- 
matically: if the first step, in which Z| is varied while t, is kept fixed, results in a deaease in the objective function F, then 
when X/ is subsequently allowed to vary it cannot pass through zero, because doing so would require an increase in F 
(since the contribution of {Z|, yjto F would then be zero). 

Note that each computation of a given {Zj, rj pair is repeated in phase 1 several (7) times, with different Initial val- 
ues for the Xj. T is determined heuristically from the number M of different minima in F found. For the above-mentioned 
data sets. M was usually 2 or 3. and 7 was chosen as 7=10. 

In phase 2, all vectors Xj found in phase 1 are concatenated into a single vector, and the unconstrained minimiza- 
tion process then applied again, allowing all parameters to vary It should be noted that phase 2 often results in roughly 
a factor of two further reduction in the objective function F. 

In accordance with the principles of the inventions, the following first order unconstrained optimization method was 
used for both phases. The search direction is found using conjugate gradients. Bracketing points xj , X2 and Xg are 
found along the search direction such that F{x-f) > F(x^ < F(x^. The bracket is then balanced (for balancing tech- 
niques, see, e.g., W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Numerical Recipes in C. Second Edi- 
tion, Cambridge University Press. 1992). The minimum of the quadratic fit through these three points is then used as 
the starting point for the next iteration. The conjugate gradient process is restarted after a fixed, chosen number of iter- 
ations, and the wfiole process stops when the rate of decrease of F falls below a threshold. K should be noted that this 
general approach gave the same results as the analytic approach when applied to the case of the quadratic polynomial 
kernel, described above. ■ 

Fxoeriments 

The above approach was applied to the SVM that gave the best performance on the postal set, which was a degree 
3 inhomogeneous polynomial machine (for the latter see. e.g., The Nature of Statistical Learning Theory, cited above). 
The order of approximation. N^, was chosen to give a factor often speed up in test phase for each two-class classifier. 
The results are given in Table 2 (shown below). The reduced set method achieved the speed up with essentially no loss 
in accuracy. Using the ten classifiers together as a ten-class classHier (for the latter, see. e.g.. The Nature of Statistical 
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Learning Theory, and Support Vector Networks, cited above) gave 4.2% error using the lull support set, as opposed to 
4.3% using the reduced set. Note that for the combined case, the reduced set gives only a factor of six speed up, since 
different two class classifiers have some support veaors In common, allowing the possibility of caching. To address the 
question as to whether these techniques can be scaled up to larger problems, the study was repeated for a two-class 
classifier separating digit 0 from all other digits for the NIST set (60.000 training, 10,000 test patterns). This classifier 
was also chosen to be that which gave best accuracy using the full support set: a degree 4 polynomial. The full set of 
1 ,273 support vectors gave 19 test errors, while a reduced set of size 127 gave 20 test errors. 





Support Vectors 


Reduced Set 


Digit 


Ns 


Es 


Nz 


Ez 


0 


272 


13 


27 


13 


1 


109 


9 


11 


10 


2 


380 


26 


38 


26 


3 


418 


20 


42 


20 


4 


392 


34 


39 


32 


5 


397 


21 


40 


22 


6 


257 


11 


26 


11 


7 


214 


14 


21 


13 


8 


463 


26 


46 


28 


9 


387 


13 


39 


13 


Totals: 


3289 


187 


329 


188 


(Note that tests were also done on the full 1 0 digit NIST giving a factor of 50 speedup with 10% 
loss of accuracy; see C J.C. Burges . B. SchOlkopf. Improving the Accuracy and Speed of Sup- 
port Vector Machines, in press. NIPS "96.) 



35 Illustrative Ent>Qdlment 

Turning now to FIG. 3, an illustrative flow chart embodying the principles of the invention is shown for use in a train- 
ing phase of an SVh/l. Input training data is applied to an SVfvl (not shown) in step 100. The SVM is trained on this input 
data in step 105 and generates a set of support vectors In step 1 10. A number of reduced set vectors is selected in step 

40 1 35. In step 1 1 5, the unconstrained optimization approach (descrtoed above) is used to generate reduced set vectors 
in step 1 20. These reduced set vectors are used to test a set of sample data (not shown) in step 1 25. Results from this 
test are evaluated in step 130. If the test results are acceptable (e.g., as to speed and accuracy), then the reduced set 
vectors are available for subsequent use. If the test results are not acceptable, then the process of determining the 
reduced set vectors Is performed again. (In this latter case, it should be noted that the test results (e.g.. in terms of 

4S speed and/or accuracy) could suggest a further reduction in the number of reduced set vectors.) 

Once the reduced set vectors have been determined, they are available for use in a SVM. A method for using these 
reduced set vectors in a testing phase is shown in FIG. 4. In step 215. input data vectors from a test set are applied to 
the SVM. In step 220, the SVM transforms the input data vectors of the testing set by mapping them into a multidimen- 
sional space using reduced set vectors as parameters in the Kernel. In step 225, the SVM generates a classification 

£0 signal from the decision surface to indicate the membership status of each input data vector. 

As noted above, a number, m. of reduced set vectors are in the reduced set. These reduced set vectors are deter- 
mined in the above-mention training phase illustrated in FIG. 3. If the speed and accuracy data suggest that less than 
m reduced set vectors can be used, an alternative approach can be taken that obviates the need to recalculate a new. 
and smaller, set of reduced set vectors. In particular, a number of reduced set vectors, x, are selected from the set of 

55 m reduced set vectors, where x < m. In this case, the determination of how many reduced set vectors, x, to use is 
empirically determined, using, e.g., the speed and accuracy data generated in the training phase. However, there is no 
need to recalculate the values of these reduced set vectors. 

An illustrative embodiment of the inventive concept is shown in FIG. 5 in the context of pattern recognition. Pattern 
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recognition system 100 comprises processor 105 and recognizer 110, which further comprises data capture element 
1 15, and SVM 120. Other than the inventive concept, the elements of FIG. 5 are well-known and will not be described 
in detail. For example, data.input element 115 provides input data for classification to SVM 120. One example of data 
input element 1 15 is a scanner. In this context, the input data are pixel representations of a image (not shown). SVM 
120 operates on the input data in accordance with the principles of the invention using reduced set vectors. During 
operation, or testing, SVM 120 provides a numerical result representing classification of the input data to processor 1 05 
tor subsequent processing. Processor 105 is representative of a stored-program-controlled processor such as a micro- 
processor with associated memory. Processor 105 additionally processes the output signals of recognizer 1 10. such 
as, e.g., In an automatic teller machine (ATM). 

The system shown in FIG. 5 operates in two modes, a training mode and an operating (or test) mode. An illustration 
of the training mode is represented by the above-described method shown in FIG. 3. An illustration of the test mode is 
represented by the above-described method shown in FIG. 4. 

The foregoing merely illustrates the principles of the Invention and it vinll thus be appreciated that those skilled In 
the art will be able to devise numerous alternative arrangements. 

For example, the inventive concept is also applicable to kernel-tDased methods other than support vector machines, 
which can also be used for, but are not limited to, regression estimates, density estimation, etc. 



1 . A method for using a support vector machine, the method comprising the steps of: 

receiving input data signals; and 

using the support vector machine operable on the Ir^ data signals for providing an output signal, wherein the 
support vector machine utilizes reduced set vectors, wherein the reduced set vectors were a priori determined 
during a training phase using an optimization approach other than an eigenvalue computation used for homo- 
geneous quadratic kernels. 

2. The method of claim 1 wherein the training pliase further oomprises the steps of: 

receiving elements of a training set; 

generating a set of support vectors, the number of support vectors being Ns, 
selecting a number m of reduced set vectors, where m^Ns, and 

generating the number m of reduced set vectors using the unconstrained optimization approach. 

3. The method of claim 1 wherein the optimization approach is a unconstrained optimization approach. 

4. The method of claim 1 wherein the input data signals represent different patterns and the output signal represents 
a classification of the different patterns. 

5. The method of claim 1 wherein the training phase further comprises the steps of: 

training the support vector machine for determining a number, Ng, of support vectors; and 

using an unconstrained optimization technique to determine the reduced set vectors, where a number of 

reduced set vectors is m, where m ^Ng- 

6. A support vector machine comprising: 

a data capture element for providing input data signals; and 

a support vector machine operaisle on the input data signals for providing at least one output data signal, 
wherein the support vector machine operates on the input data signal using reduced set vectors determined a 
priori using an optimization approach other than an eigenvalue computation used for homogeneous quadratic 
kernels. 

7. The apparatus of claim 6 wherein the data capture element provkles input data signal representative of a plurality 
of images applied to the data capture element. 

8. The apparatus of daim 7 wherein the at least one output signal of the support vector machine Is representative of 
a classification of each image. 
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9. The apparatus of claim 6 where the number of reduced set vectors is less than a number of support vectors. 

10. The apparatus of claim 6 wherein the optimization approach is an unconstrained optimization approach. 

1 1 . The apparatus of claim 1 0 wherein the reduced set vectors are determined a priori while training the support vector 
machine using the unconstrained optimization approach. 
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(54) Method and apparatus for improving the efficiency of support vector machines 

(57) A method and apparatus is described for 
improving the efficiency of any machine that uses an 
algorithm that maps to a higher dimensional space in 
which a given set of vectors is used in a test phase. In 
particular, reduced set vectors are used. These reduced 
set vectors are different from the vectors in the set and 
are deternraned pursuant to an optimization approach 
other than the eigenvalue computation used for homo- 
geneous quadratic kernels. An illustrative embodiment 
is described In the context of a support vector machine 
(SVM). 
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